Skip to content

Events listing, deletion and archival optimizations#11443

Open
winterhazel wants to merge 7 commits intoapache:mainfrom
scclouds:listevents-optimizations
Open

Events listing, deletion and archival optimizations#11443
winterhazel wants to merge 7 commits intoapache:mainfrom
scclouds:listevents-optimizations

Conversation

@winterhazel
Copy link
Copy Markdown
Member

Description

The listEvents and deleteEvents APIs execute some non-optimized queries, which results in very long response times in environments that have many events.

Some optimizations were performed in these two workflows in order to reduce response times. The listing now uses a covering index, and does not perform a unecessary DISTINCT anymore. The deletion is now performed in batches. The batch size is defined by the global setting delete.query.batch.size.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

How Has This Been Tested?

Events listing

In a test environment with 2 million event entries, I called listEvents multiple times passing different combinations of account, archived, domainid, duration, enddate, entrytime, id, keyword, level, projectid, resourceid, resourcetype, startdate, startid and type. For almost all the listings, I got a more reasonable response time (between 0.1 and 3 seconds depending on the filters) compared to before (30+ seconds). The only exception is the keyword parameter, which is not optimizable with a index because it performs a column LIKE "%keyword%" for 3 different columns.

Access control

I called listEvents and verified that:

  • root admins can list all events;
  • domain admins can only list events from their domain;
  • users can only list their events.

Events deletion

  1. I assigned a value to delete.query.batch.size, enabling batch delete;
  2. I called deleteEvents to remove all events before 2025-01-01. Then, I verified that they were removed successfully;
  3. I assigned a value to event.purge.delay, enabling the automatic deletion of events older than 15 days. After the deletion task executed, I verified that events older than 15 days were removed successfully.
Access control

I called deleteEvents and verified that:

  • root admins can delete any events;
  • domain admins can only delete events from their domain;
  • users can only delete their events.

@winterhazel
Copy link
Copy Markdown
Member Author

@blueorangutan package

@winterhazel winterhazel added this to the 4.20.2 milestone Aug 13, 2025
@blueorangutan
Copy link
Copy Markdown

@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 13, 2025

Codecov Report

❌ Patch coverage is 41.17647% with 70 lines in your changes missing coverage. Please review.
✅ Project coverage is 18.08%. Comparing base (3166e64) to head (7012147).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
...rc/main/java/com/cloud/event/dao/EventDaoImpl.java 0.00% 43 Missing ⚠️
...in/java/com/cloud/server/ManagementServerImpl.java 78.04% 3 Missing and 6 partials ⚠️
...stack/api/command/user/event/ArchiveEventsCmd.java 57.14% 6 Missing ⚠️
...dstack/api/command/user/event/DeleteEventsCmd.java 61.53% 5 Missing ⚠️
...ava/com/cloud/upgrade/dao/Upgrade42210to42300.java 0.00% 4 Missing ⚠️
...ain/java/com/cloud/upgrade/dao/DbUpgradeUtils.java 0.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #11443      +/-   ##
============================================
+ Coverage     18.01%   18.08%   +0.06%     
- Complexity    16607    16703      +96     
============================================
  Files          6029     6036       +7     
  Lines        542160   542433     +273     
  Branches      66451    66421      -30     
============================================
+ Hits          97682    98100     +418     
+ Misses       433461   433313     -148     
- Partials      11017    11020       +3     
Flag Coverage Δ
uitests 3.52% <ø> (ø)
unittests 19.24% <41.17%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14617

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the performance of events listing and deletion operations to reduce response times in environments with large numbers of events. The optimizations include removing unnecessary DISTINCT operations, adding a covering index for better query performance, and implementing batch deletion.

  • Replaced DISTINCT with NATIVE function in event listing queries for better performance
  • Implemented batch deletion using the existing delete.query.batch.size configuration
  • Added a new covering index on the event table to optimize search queries

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
ManagementServerImpl.java Refactored deleteEvents method to use batch deletion and simplified access control logic
ConfigurationManagerImpl.java Updated documentation for DELETE_QUERY_BATCH_SIZE to include events
QueryManagerImpl.java Removed DISTINCT operation from event search query for performance
EventDaoImpl.java Added new purgeAll method for batch deletion and removed unused listOlderEvents method
EventDao.java Added purgeAll method interface and removed listOlderEvents method
DomainDaoImpl.java Added getDomainAndChildrenIds convenience method
DomainDao.java Added getDomainAndChildrenIds method interface
DeleteEventsCmd.java Moved parameter validation from command to service layer and improved error message
ListEventsCmd.java Removed unnecessary blank line
Upgrade42010to42020.java Added database upgrade class to create covering index
DbUpgradeUtils.java Added utility method for creating indexes with custom names
schema-42010to42020.sql Empty schema upgrade file
schema-42010to42020-cleanup.sql Empty schema cleanup file

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@winterhazel
Copy link
Copy Markdown
Member Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 14637

Copy link
Copy Markdown
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good, will make the delete events action much better. Needs testing.

@winterhazel could we extend this to cover archive events as well? Also, I’m not aware of any other DB changes slated for 4.20.2. If this is the only one, would it be safer to target 4.22? Not blocking—just a suggestion.

Comment thread server/src/main/java/com/cloud/server/ManagementServerImpl.java Outdated
@winterhazel
Copy link
Copy Markdown
Member Author

@winterhazel could we extend this to cover archive events as well? Also, I’m not aware of any other DB changes slated for 4.20.2. If this is the only one, would it be safer to target 4.22? Not blocking—just a suggestion.

@shwstppr I'll have a look into whether event archiving needs or would benefit from any changes.

About targeting 4.22 instead: the only DB change here is the addition of an index. This can not cause any inconsistency as far as I know, so it should be safe going to 4.20.2.

@weizhouapache weizhouapache modified the milestones: 4.20.2, 4.22.0 Sep 11, 2025
@weizhouapache
Copy link
Copy Markdown
Member

Moving to 4.22 milestone as it has some DB changes
cc @winterhazel @harikrishna-patnala

@winterhazel winterhazel marked this pull request as draft September 23, 2025 21:46
@harikrishna-patnala harikrishna-patnala modified the milestones: 4.22.0, 4.22.1 Nov 12, 2025
@winterhazel winterhazel force-pushed the listevents-optimizations branch from 0a8e293 to 73f79b8 Compare December 28, 2025 17:26
@winterhazel winterhazel changed the base branch from 4.20 to main December 28, 2025 17:26
@winterhazel winterhazel marked this pull request as ready for review December 28, 2025 17:27
@winterhazel winterhazel force-pushed the listevents-optimizations branch from 73f79b8 to f52a619 Compare December 28, 2025 17:30
@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-15078)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 56017 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr11443-t15078-kvm-ol8.zip
Smoke tests completed. 148 look OK, 2 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_03_deploy_and_scale_kubernetes_cluster Failure 27.90 test_kubernetes_clusters.py
test_02_isolate_network_FW_PF_default_routes_egress_false Failure 120.40 test_routers_network_ops.py

@github-actions
Copy link
Copy Markdown

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@winterhazel
Copy link
Copy Markdown
Member Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 16622

@github-actions
Copy link
Copy Markdown

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

@winterhazel
Copy link
Copy Markdown
Member Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@winterhazel a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17555

@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1246 to +1249
long totalArchived = _eventDao.archiveEvents(ids, type, startDate, endDate, accountId, domainIds,
ConfigurationManagerImpl.DELETE_QUERY_BATCH_SIZE.value());

if (ids != null && events.size() < ids.size()) {
return false;
}
_eventDao.archiveEvents(events);
return result;
return totalArchived > 0;
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ids is provided, this method reports success if any matching rows were archived. That means a request like ids=[1,2,3] can return success even if only a subset was actually archived (e.g., some IDs don’t exist or aren’t in the caller’s scope). If the API is expected to be all-or-nothing for explicit IDs, consider returning failure (or throwing) when totalArchived != ids.size() when ids is non-empty, or otherwise make the partial-success behavior explicit.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

@winterhazel winterhazel Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the previous behavior. I do not want to change the response of the API in this patch, just optimize its operations. Edit: there was a change. I will adjust.

Comment on lines 81 to +101
@Override
public void archiveEvents(List<EventVO> events) {
if (events != null && !events.isEmpty()) {
TransactionLegacy txn = TransactionLegacy.currentTxn();
txn.start();
for (EventVO event : events) {
event = lockRow(event.getId(), true);
event.setArchived(true);
update(event.getId(), event);
txn.commit();
public long archiveEvents(List<Long> ids, String type, Date startDate, Date endDate, Long accountId, List<Long> domainIds,
long limitPerQuery) {
SearchCriteria<EventVO> sc = createEventSearchCriteria(ids, type, startDate, endDate, null, accountId, domainIds);
Filter filter = null;
if (limitPerQuery > 0) {
filter = new Filter(limitPerQuery);
}

long archived;
long totalArchived = 0L;

do {
List<EventVO> events = search(sc, filter);
if (events.isEmpty()) {
break;
}
txn.close();

archived = archiveEventsInternal(events);
totalArchived += archived;
} while (limitPerQuery > 0 && archived >= limitPerQuery);
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

archiveEvents can load all matching events into memory when limitPerQuery <= 0 (the default for delete.query.batch.size is 0), then builds a single UPDATE ... WHERE id IN (...) statement containing every ID. On large event tables this risks OOMs, oversized SQL statements, and very long-running queries. Consider always archiving in batches (even when the config is 0), or implement a batched UPDATE ... LIMIT <batch> loop similar to batchExpunge so the operation remains bounded.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the configuration and doing it in batches anyways will make having a configuration weird. I think that we can just enable batch operations by default instead.

Comment on lines 81 to +126
@Override
public void archiveEvents(List<EventVO> events) {
if (events != null && !events.isEmpty()) {
TransactionLegacy txn = TransactionLegacy.currentTxn();
txn.start();
for (EventVO event : events) {
event = lockRow(event.getId(), true);
event.setArchived(true);
update(event.getId(), event);
txn.commit();
public long archiveEvents(List<Long> ids, String type, Date startDate, Date endDate, Long accountId, List<Long> domainIds,
long limitPerQuery) {
SearchCriteria<EventVO> sc = createEventSearchCriteria(ids, type, startDate, endDate, null, accountId, domainIds);
Filter filter = null;
if (limitPerQuery > 0) {
filter = new Filter(limitPerQuery);
}

long archived;
long totalArchived = 0L;

do {
List<EventVO> events = search(sc, filter);
if (events.isEmpty()) {
break;
}
txn.close();

archived = archiveEventsInternal(events);
totalArchived += archived;
} while (limitPerQuery > 0 && archived >= limitPerQuery);

return totalArchived;
}

@DB
private long archiveEventsInternal(List<EventVO> events) {
final String idsAsString = events.stream()
.map(e -> Long.toString(e.getId()))
.collect(Collectors.joining(","));
final String query = String.format("UPDATE event SET archived=true WHERE id IN (%s)", idsAsString);

try (TransactionLegacy txn = TransactionLegacy.currentTxn();
PreparedStatement pstmt = txn.prepareStatement(query)) {
return pstmt.executeUpdate();
} catch (SQLException e) {
throw new CloudRuntimeException(e);
}
}

@Override
public long purgeAll(List<Long> ids, Date startDate, Date endDate, Date limitDate, String type, Long accountId,
List<Long> domainIds, long limitPerQuery) {
SearchCriteria<EventVO> sc = createEventSearchCriteria(ids, type, startDate, endDate, limitDate, accountId, domainIds);
return batchExpunge(sc, limitPerQuery);
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New batching semantics in archiveEvents(...)/purgeAll(...) are non-trivial (limit handling, iteration/termination, criteria composition) but there are currently no unit tests covering these paths (unlike other DAO batch-expunge usages in engine/schema/src/test). Consider adding a focused DAO unit test validating that multiple batches are processed correctly and that limitPerQuery <= 0 behaves as intended.

Copilot uses AI. Check for mistakes.
Comment thread server/src/main/java/com/cloud/server/ManagementServerImpl.java Outdated
_accountMgr.checkAccess(CallContext.current().getCallingAccount(), null, false, sameOwnerEvents);
long totalRemoved = _eventDao.purgeAll(ids, startDate, endDate, null, type, accountId, domainIds,
ConfigurationManagerImpl.DELETE_QUERY_BATCH_SIZE.value());

Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When ids is provided, this method reports success if any rows were deleted. This can silently ignore IDs that don’t match the criteria / caller scope while still returning success, which is potentially misleading for callers expecting all requested IDs to be removed. Consider aligning the return semantics with the requested operation (e.g., fail when ids is non-empty and totalRemoved != ids.size(), or explicitly document/return partial results).

Suggested change
if (CollectionUtils.isNotEmpty(ids)) {
return totalRemoved == ids.size();
}

Copilot uses AI. Check for mistakes.
Comment thread engine/schema/src/main/java/com/cloud/event/dao/EventDaoImpl.java
Comment on lines +108 to +114
final String idsAsString = events.stream()
.map(e -> Long.toString(e.getId()))
.collect(Collectors.joining(","));
final String query = String.format("UPDATE event SET archived=true WHERE id IN (%s)", idsAsString);

try (TransactionLegacy txn = TransactionLegacy.currentTxn();
PreparedStatement pstmt = txn.prepareStatement(query)) {
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

archiveEventsInternal builds SQL by string-concatenating the IDs into an IN (...) clause. Besides the query-size risk, it bypasses parameter binding and can stress SQL parsing/plan caching. Consider using a parameterized statement (placeholders) or a different batching strategy (e.g., update by criteria with LIMIT) to keep statements small and reusable.

Suggested change
final String idsAsString = events.stream()
.map(e -> Long.toString(e.getId()))
.collect(Collectors.joining(","));
final String query = String.format("UPDATE event SET archived=true WHERE id IN (%s)", idsAsString);
try (TransactionLegacy txn = TransactionLegacy.currentTxn();
PreparedStatement pstmt = txn.prepareStatement(query)) {
if (CollectionUtils.isEmpty(events)) {
return 0L;
}
final String placeholders = events.stream()
.map(event -> "?")
.collect(Collectors.joining(","));
final String query = String.format("UPDATE event SET archived=true WHERE id IN (%s)", placeholders);
try (TransactionLegacy txn = TransactionLegacy.currentTxn();
PreparedStatement pstmt = txn.prepareStatement(query)) {
for (int i = 0; i < events.size(); i++) {
pstmt.setLong(i + 1, events.get(i).getId());
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion does not make sense to me

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-15961)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 50779 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr11443-t15961-kvm-ol8.zip
Smoke tests completed. 151 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good, two questions

  • can we expect / catch any runtime exceptions in the command objects (ArchiveEvents and DeleteEvents)?
  • will further testing (beyond unit testing) be needed?

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@winterhazel
Copy link
Copy Markdown
Member Author

  • can we expect / catch any runtime exceptions in the command objects (ArchiveEvents and DeleteEvents)?

@DaanHoogland InvalidParameterValueException can be thrown depending on the parameters, and CloudRuntimeException if there is a database-related error. I don't think there's any need to catch those in the command though.

  • will further testing (beyond unit testing) be needed?

Some basic manual testing would be nice.

@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan test keepEnv

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-15998)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 46856 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr11443-t15998-kvm-ol8.zip
Smoke tests completed. 151 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@DaanHoogland DaanHoogland self-assigned this May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants